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ABSTRACT 


Misconceptions have been an important area of study in 
STEM education towards improving our understanding of 
learners’ construction of knowledge. The advent of large- 
scale tutoring systems has given rise to an abundance of 
data in the form of learner question-answer logs in which 
signatures of misconceptions can be mined. In this work, we 
explore the extent to which collected expert misconception 
diagnoses can be generalized to held-out questions to add 
misconception semantics. We attempt this generalization 
by way of a question-answer neural embedding trained on 
chronological sequences of learner answers. As part of our 
study, we collect natural language misconception diagnoses 
from math educators for a sampling of student answers to 
questions within four topics on Khan Academy. Drawing 
inspiration from machine translation, we use a multinomial 
logistic regression model to explore how well the expert mis- 
conception semantics, in the form of bag-of-words vectors, 
can be mapped onto the learned embedding space and inter- 
polated. We evaluate the ability of the space to generalize 
expert diagnoses using three levels of cross-fold validation in 
which we measure the recall of predicted natural language di- 
agnoses across rater, topics, and questions. We find that the 
embedding provides generalization performance substantially 
beyond baseline approaches. 


1. INTRODUCTION 


The notion of mapping out abstract spaces of student learn- 
ing and development has been around for ages, with Zone of 
Proximal Development [23] serving as a canonical example 
of defining the area of topics a student could learn with help 
from peers and the topics beyond. Work in Educational Data 
Mining has explored mapping out learning spaces taking the 
form of tree structures [4] or concept nodes in a directed 
graph [11], often used to represent prerequisite relationships. 
Other work has mapped out progress points within a course 
and their relationship to classical psychometric measures of 
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ability [1]. In this work, we build on the idea of conceiving a 
space of learning as an embedding, or set of continuous vec- 
tors, with parts of the space indicative of different states of 
understanding and misconception [14]. We learn this embed- 
ding from sequences of millions of answers to exercises from a 
popular STEM tutoring system, then recruit qualified experts 
to diagnose a sampling of common wrong answers, providing 
natural language semantics to associate with question an- 
swers at their respective locations in the embedding. To test 
if the embedding generalizes these short form diagnoses, we 
use linear interpolation of the learned vector space to predict 
the words used in held-out diagnoses, holding out by expert, 
problem type, and question in cross-validation experiments. 
Successful predictive generalization in this task has implica- 
tions for surfacing automatically generated misconception 
hypotheses to both teachers and computer tutors. 


2. RELATED WORK 

The theory of mathematical misconceptions described by 
Piaget [16], and considered by Smith, diSessa, and Roschelle 
[19] is one of continually developing partial understandings. 
Analysis of learner responses, rather than only correctness, 
may reveal aspects of their understandings. In the age of 
big data and computation, several modern approaches have 
brought different perspectives to the analysis of misconcep- 
tions. Feldman et al. [5] generated plausible production 
rules that could have produced the common wrong answers 
observed in student responses to addition questions in 11 
elementary schools. In the vein of KC model or Q-matrix im- 
provement [22], Liu, Patel, & Koedinger [6] explored adding 
KCs symbolizing buggy production rules to problem steps 
whose correct answer could be arrived at in spite of applying 
the buggy production. They found that the inclusion of this 
item-level misconception tagging improved the overall fit of 
their AFM model and the validity of the learned individual 
student parameters. Most complimentary to our work is the 
work of Michalenko, Lan, & Baraniuk [8], who did not study 
misconceptions in common wrong answers, but rather miscon- 
ceptions found in the text of long open response text, using 
skip-grams and other embedding methods. Their approach is 
complementary to ours in that it cannot be applied to short, 
numeric answers in isolation. Inversely, our approach, which 
extends the embedding context across questions, is driven 
by questions that generate common wrong answers across 
students, which would exclude direct applicability to long 
answer response text. 
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2.1 Buggy Rules 

In the cognitive theories underlying the design of intelligent 
tutoring systems [2], there are rules that produce correct and 
consistent answers, and efforts have been made towards cat- 
aloging collections of buggy rules that could instead produce 
incorrect answers. These buggy rules could represent miscon- 
ceptions that students often have during the learning process 
[3]. This large collection of buggy rules is often referred to 
as a bug catalog [20]. As a student moves through a problem 
set, the bug catalog enables tutoring systems to tag, track, 
and respond to a path of answers the student provides. 


Past research efforts to classify these buggy rules also include 
the manual labeling of misconceptions by experts [7], the ex- 
ploration of cluster relationships between the wrong answers 
[15], and approaches that take into account the frequency 
of student misconceptions [21]. These efforts lay the foun- 
dation for automated approaches which utilize these buggy 
rules to generate targeted guidance messages specific to each 
incorrect answer [18]. 


2.2 Use of Skip-grams 

Skip-gram models were originally applied to the embedding 
of words based on a large corpus of text (e.g. Wikipedia or 
a large archive of news articles). Once trained, the represen- 
tational (hidden) layer of these models was shown to encode 
distributed concepts in the form of syntactic (e.g., bee is 
to bees as goose is to geese) as well as semantic relation- 
ships (e.g., Einstinen is to scientist as Picasso is to painter) 
[10]. While conventionally applied to language in its debut, 
skip-grams have been applied to non-linguistic data from 
education. University courses were embedded from sequences 
of enrollments [13] to find course similarities outside of what 
could be inferred from catalog descriptions. Questions within 
the ASSISTments tutoring platform were embedded based 
on sequences in which problems were answered in order to 
predict the skill of untagged questions [12]. Skip-grams and 
other embedding models have been applied to standard natu- 
ral language in educational contexts, such as the learning of 
vector representations of open response text and correlating 
vector representations with the presence or absence of hand 
coded misconceptions|8]. 


3. TUTOR DATA SET 


Our dataset of anonymized student answer logs comes from 
Khan Academy, an online STEM tutoring platform. As 
described in our previous work [14], Khan Academy catego- 
rizes student responses by exercise, a broad skill similar to 
those seen in ASSISTments Skill Builder sets; by problem 
type, a problem template; and finally by seed, one of two 
hundred values per problem type which uniquely identifies 
a template instantiation. Each log entry also contains an 
anonymous user ID and timestamp, which we use to group 
and chronologically sort student answers for model training. 


We used the same exercise selection process as in [14] to nar- 
row our focus to exercises with sufficient data and concerning 
topics that would likely surface interesting misconceptions 
for educators to analyze and describe. This involved consult- 
ing a subject matter expert in mathematical education [17] 
and verifying the correctness of the log entries by forming a 
sample set of questions and manually accessing their respec- 
tive web pages on Khan Academy. At the conclusion of this 


filtering process, we identified four suitable exercises to use 
in our experiments: 
1. “Surface Areas” (SA) 
2. “Slope from an equation in slope intercept form” (SESI) 
3. “Area of quadrilaterals and polygons” (AQP) 
4. “Adding and subtracting fractions” (ASF) 


Table 1 shows statistics for each exercise. 


[______| SA_ | SESI_| AQP] ASF 
[Problem Types[_6 [| 2 | 2 | 7 
20 50 40 


= 20 — 80] 
105,659 | 33,003 | 58,239 | 179,263 


Unique Incor- 
rect Answers 


Total Incorrect | 619,045 | 112,390 | 298,356 | 873,916 
Answers 


Table 1: Descriptive statistics of exercises used to 
train the skip-gram models. 


55,126 6,912 17,998 | 46,516 


A second dataset was collected as part of this study, which 
consisted of natural language diagnoses of common wrong 
answers from our chosen exercises. These diagnoses were 
written by mathematics educators, with each diagnosis ex- 
plaining the misconception that was potentially responsible 
for the incorrect answer. We collected misconception diagno- 
sis labels using an online survey platform.’ We describe the 
collection of these data in Section 4.2. 


4. METHODOLOGY 


In this section, we describe the techniques employed to com- 
plete three primary methodological tasks: 


1. Generate learned question answer embeddings from 
student answer logs 


2. Generate bag-of-words representations of the semantic 
data contained in educator diagnoses of the miscon- 
ceptions associated with the incorrect student answers 
from (1.) 


3. Compute a model that generalizes semantic diagnoses of 
wrong answers based on regression from the continuous 
vectors of (1.) to the semantic representations from 


(2.) 


Figure 1 depicts the full data processing and machine learning 
pipeline that we implemented to complete these tasks, using 
both the answer event logs and the misconception diagnoses 
as inputs and outputting natural language diagnoses for 
held-out question answers. 


4.1 Embedding Student Answers 

As described in Section 2, machine learning models origi- 
nally intended to model natural language have recently been 
applied to a number of other domains, including education. 
Motivated by the success of these efforts, we used a skip-gram 
neural network model to learn representations of student an- 
swers. A representation in our setting, or embedding, is a 
vector in a high-dimensional space that is learned by a skip- 
gram model. We use the same strategy as in [14] to encode 
each student answer in a token containing its seed and the 
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Student Answers 
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Misconception 
Analysis 


Pre- 
processing 
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“Student forgot Be 
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Bag of Words 
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Token “a1” = 
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Logistic 
Regression 


Predicted 


: : [0.01, 0.035, @.11, ...] 
Misconception Tags 


Figure 1: The pipeline used to model student an- 
swers, teacher diagnoses, and their correlation. 


frequency rank of the student’s response within that seed. 
For example, if a student were to answer a question generated 
from seed x01b with the most frequently occurring incorrect 
answer to that question, their answer would be represented 
by the token x01b_1. 


A skip-gram model is a two-layer neural network (one hidden 
layer) that analyzes a corpus of token sequences to learn 
continuous vector representations for each of these tokens. 
Vectors are trained with the goal of predicting the context 
of each token. For example, x01b_2 would have s03c_4 in 
its context if students often provide incorrect responses to 
those questions in succession. The loss function (Eq. 1) 
for the training process, described in [10], seeks to optimize 
the log-likelihood of the tokens in context given a specific 
input token. S represents the set of input sequences for 
the model, each corresponding to a student’s sequence of 
responses to a given exercise. c represents the window size, 
a hyperparameter of the model that specifies the width of 
a token’s context when learning its representation, and T 
represents the number of tokens in sequence s. 


C- Say ye log P(wt+;|we) (1) 


ses t=1 —c<j<ec 
J#O 
We use the negative sampling variant for training the skip- 
grams as introduced in [10], which replaces the final term of 
the form log P(wo|wr) in Equation 1 with 


k 
log o (vio ewr) + So Bikes) [log a (of ows) | (2) 
i=1 


Above, o represents the sigmoid function. Roughly, this 
formulation seeks to include the weights of k randomly chosen 
negative samples, i.e., tokens w; that do not occur within 


the context of the target token wo, in the backpropagation 
process. Unlike the original hierarchical softmax formulation, 
negative sampling has the advantage of only adjusting pairs 
of weights in the underlying network during backpropagation. 


4.2 Collecting Teacher Diagnoses 

We collected expert-generated semantic misconception diag- 
nosis data through a questionnaire designed and run on the 
Qualtrics platform. Qualtrics recruited survey participants 
and compensated them on our behalf at a rate of $30 per 
participant. We had Qualtrics recruit participants who: 


e Are working as a mathematics educator for students 
who are in grades 5-12 or undergraduates 
e Have at least two years of prior teaching experience 


The number of problem types and seeds within each exercise 
included in the survey is shown in Table 2. For each seed, 
we formed a batch of the five most frequently submitted 
incorrect answers to present to survey participants. 


# Prob. Typ 


2 
5 


3 

aaa |e 

Table 2: Wrong answer exercises, problem types, 
and seeds for which expert diagnoses were sought 


17 
18 

6 
18 


Each survey participant was provided with initial instruc- 
tions, excerpted in Figure 2. Next, they were shown three 
randomly selected answer batches. For each batch, the survey 
respondent was presented with a screenshot of the original 
question as it appeared on Khan Academy, the text of the 
five incorrect student answers, and text boxes to write a brief 
misconception diagnosis for each answer. An example Khan 
Academy question and the associated diagnoses we collected 
are shown in Figure 3. 


Respond with a general label-phrase that describes the 
most likely error or misconception related to the incorrect 
answer. 
e Avoid references to specifics of the question (e.g., do 
not say “additive inverse is 4, not —4”). 
Your label or phrase should be general enough such that 
it could potentially be applied to other incorrect an- 
swers. Therefore, you may duplicate labels and phrases 
as you see appropriate. 


Avoid abbreviations (e.g., use “y intercept” instead of 
“yint”). 


Example Responses 
Question: Solve 3x — 4 = 20 


1 
Student Answer: 5 — 


Example Label-Phrase: opposite of additive inverse 


Figure 2: An excerpt of the instructions presented 
to survey participants providing expert diagnoses 


Alternatively, we could have asked experts to create miscon- 
ception labels out of terms drawn from a fixed taxonomy, 
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Answer | Misconception Diagnosis 


17/20 Added 5 + 12 instead of 5 — 12. 


—17/20 | Added 5+12 instead of 5— 12. And 
TT Lined ncoeectsign 


Figure 3: A sample Khan Academy question and 
corresponding misconception labels 


rather than to compose these labels from scratch and with- 
out explicit guidance. However, the terms in this taxonomy 
would inevitably reflect our own biases and assumptions and 
may prevent experts from accurately describing their obser- 
vations. Instead, we allowed a broad vernacular, but also 
asked experts to review their labels at the end of the survey 
to encourage them to be consistent in their language. 


We found that the quality of survey responses varied dra- 
matically within our dataset and developed a procedure to 
identify and retain only misconception labels that were suit- 
able for further analysis. We manually excluded all responses 
where an attempt at a label was clearly not present, such 
as “idk.” Next, we retained diagnoses only from experts who 
wrote labels with an average length of 20 characters or more. 
This process left us with 570 unique diagnoses covering 14 
of the 15 problem types and 64 of the 89 seeds. 


4.3 Processing Teacher Diagnoses 

After collecting expert misconception diagnoses through the 
survey platform, we performed data pre-processing to even- 
tually represent each label in bag-of-words form. Many 
diagnoses contained references to specific numbers found in 
the instantiation of the question. We chose not to give every 
numerical quantity its own token but rather to replace each 
contiguous mathematical expression with the token nunN, 
representing the N‘® contiguous expression appearing in the 
diagnoses for each seed. Numbers used to describe general 
misconception rules, e.g. the factor of 1/2 used in computing 
the area of a triangle, were hand-identified and allowed to 
be represented in original form. This helps to prevent our 
models from incorrectly identifying correlations that are co- 
incidental (two question instances happen to use the same 
random quantity) rather than structural. 


Next, we stripped punctuation, removed stopwords, and per- 
formed word stemming. Finally, we manually removed some 
of the most common tokens that we deemed uninformative 
and which could have resulted in trivially easy prediction 
due to their frequency, such as student, tried, and used. 
Each processed expert diagnosis is represented as a bag-of- 
words vector, where an element of the vector indicates the 
number of occurrences of a term from a global vocabulary. 
Where we had multiple expert labels available for a single 
incorrect student answer, we concatenated the two labels 
and constructed a bag-of-words representation of the result. 


Crossfold Type 
Folds 19 14 64 
Data Points 302 296 314 
Evaluators 18 19 19 
Exercises 4 4 4 
Prob Types 14 13 14 
Seeds 61 59 


Data Points 
Evaluators 
Exercises 


Prob Types 
Seeds 


Table 3: Statistics for different cross validation 
schemes. Entries are rounded averages across folds. 


4.4 Mapping Answer Vectors to Diagnoses 
With both embeddings of student responses and expert- 
generated diagnoses in hand, we could explore the extent to 
which the continuous vector representation of an incorrect an- 
swer is related to a semantic description of the misconception 
underlying that answer. We trained a multinomial logistic 
regression model to calibrate this correspondence that uses a 
vector embedding of an incorrect student answer to predict 
the words in the expert’s diagnosis of that answer. The 
regression takes as input an m-vector representing a student 
answer, where ™ is the dimensionality of the skip-gram em- 
bedding space (a hyperparameter of the model). The model 
produces as output an n-vector, where n is the size of the 
teacher misconception diagnosis vocabulary. Because of the 
regression’s use of softmax, this n-vector forms a probability 
distribution across all terms used in the teacher diagnoses. 
The i*® element of the vector expresses the predicted proba- 
bility that the i*® term of the diagnosis vocabulary applies 
to the student answer. 


5. RESULTS 


Here, we describe our results and methodology for evaluating 
the representations produced by a skip-gram model by using 
logistic regression and the expert-generated misconception 
diagnoses. We performed a search over the hyperparameters 
of the skip-gram algorithm and then compared the predictions 
generated by our machine learning pipeline to two baselines. 


5.1 Skip-Gram Model Evaluation 


Recall from Figure 1 that we use logistic regression to train 
a model identifying correlations between embeddings of stu- 
dent answers and semantic explanations of the underlying 
misconceptions responsible for incorrect answers. The model 
surfaces correlations by taking a vector representation as 
input and producing a probability distribution over the vo- 
cabulary of terms used by educators in their misconception 
diagnoses as output. 


Using the semantic data collected from educators as ground 
truth, we evaluated the insights generated through logistic 
regression when using vectors produced by different skip- 
gram models as input. We performed a standard leave- 
one-out cross-validation (CV) procedure on the educator 
data. We then evaluated the quality of a model’s predicted 
misconception tags for student answers in the remaining fold 
using recall at N, where the value of N for each prediction 
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is equal to the number of terms used in the original expert 
label for the relevant incorrect answer. This is defined as: 


en 


R 
IT| 


(3) 
where T is the set of terms contained in an educator’s mis- 
conception diagnosis for an answer, Tw is the set of terms 
corresponding to the N largest entries in the probability 
distribution produced by the logistic regression when given 
an embedding of the answer as input, and N = |T}. 


We performed three leave-one-out cross-validations using 
each of the following to determine the fold segmentation: 
1. Evaluator: The ID of the educator who produced the 
misconception diagnosis. 


2. Problem Type: The ID of the template used to generate 
a question. 


3. Seed: The unique identifier of an instantiated question. 


Descriptive statistics concerning the train and test splits for 
each scheme are summarized in Table 3. 


5.2 Results of Hyperparameter Search 

We trained over 750 skip-gram models using different combi- 
nations of hyperparameters and then ran each model through 
the cross-validation procedure described above. The hyper- 
parameters we varied were: 


1. Vector Size: The number of elements in the vector 
representations learned by the skip-gram model 


2. Window Size: The width of each token’s context, i.e., 
the number of surrounding tokens to consider in the 
loss function defined in Equation 1. 


3. Min Count: The minimum number of times a token 
occurs in the training set to be included in the model. 


4. Training Epochs 
160 


{8 Evaluator 
[1 Problem Type] — 
Gi ~Seed 


Count 


0.10 O15 0.20 0.25 0.30 
Avg. Recall at N 
Figure 4: Distribution of average recall under the 
different cross-validation schemes. 


Figure 4 shows the distribution of recall results achieved 
by all the models under each scheme. We also examined 
the distribution of hyperparameters among the ten models 
that achieved the highest average recall at N under each 
cross-validation type. We found that this metric was not 
sensitive to the hyperparameter values among the top ten 
models for all CV types. Within each CV type, all models 
produced scores within 0.0z of one another. Table 4 shows the 


hyperparameters that produced the best performing models, 
measured by average recall, for all CV types. 


[| Exalnator [Problem Type 
P Vector Se [67 [100 _| 100 | 


[Window Sze [15 [0 | 8 | 


Table 4: The best skip-gram hyperparameter com- 
binations under each cross validation scheme. 


5.3 Diagnosis Generalization by Best Models 
We compared the recall achieved by predicting the words 
in the diagnoses using the best skip-gram embeddings and 
logistic regression to the recall achieved by two baseline 
prediction schemes. For each incorrect student answer, all 
of the methods predict N terms, where N is the number 
of terms contained in the original expert diagnosis of the 
underlying misconception for that answer. This ensures we 
can fairly measure each prediction scheme by recall at N. 
The two baselines were: 


1. Random: Generate a random sample of N terms from 
the vocabulary formed by the expert misconception 
diagnoses in the training set. 

2. Frequency: Predict the N terms that appear most 
frequently in the diagnoses from the training set. 


0.30 


Hig Random 
Frequency ; 
0.25 Gig Skip-Gram & Log. Reg. | 


Oo 
N 
Oo 


° 
an 
u 


o 
an 
° 


Avg. Recall at N 


Oo 
oO 
ul 


0.00 


Evaluator Problem Type 


CV Type 


Figure 5: Average recall achieved by different pre- 
diction schemes for each cross-validation type. 


The average recall at N achieved by the predictions generated 
through each baseline scheme, as well as that of our own ap- 
proach, is shown in Figure 5. As expected, a frequency-based 
approach outperforms a random approach in all three cross- 
validation types. In addition, the embedding-based approach 
significantly outperforms the frequency-based approach in all 
three cases by nearly 100%. The results show that between 
18% and 27% of words in held-out diagnoses were recovered. 
This improvement over baseline suggests a moderate corre- 
spondence between the regularities learned in the embedding 
and semantics used to describe misconceptions. 


Recall increased with the size of the training set, with Seed 
having the largest training set and Evaluator having the 
smallest. Other factors may also contribute to these results. 
First, we chose Khan Academy exercises spanning a diverse 
selection of mathematical concepts, and the diagnoses for 
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misconceptions that arise in one domain (e.g., fractions) may 
use very different diagnosis terms than the terms used for mis- 
conceptions in another domain (e.g., surface area). Therefore, 
there are likely cases where the training set doesn’t contain 
the proper terms to express the misconception diagnoses 
in the test set. Moreover, different educators used different 
taxonomies and terms when constructing their misconception 
diagnoses, which means a model may not be able to accu- 
rately predict the diagnoses provided by an educator that 
isn’t well represented in the training data set, which appears 
to be the situation that arises in Evaluator cross-validation. 


6. DISCUSSION 

Should the 27% recall that we achieved in predicting the 
terms of held out misconception diagnoses be considered a 
good score? There are not prior results in this particular 
area with which to compare to a state of the art. However, 
this technique of linearly translating from one space (an- 
swer embedding) to another (diagnosis bag-of-words) is akin 
to machine translation from one language’s embedding to 
another. Looking at the accuracy reported in the original 
linear machine translation paper [9], a translation accuracy 
of 10% was achieved between English and Vietnamese and 
24% translated the other way. Therefore, we could consider 
27% a comparable score to past NLP translation benchmarks 
and a performance level that may produce diagnoses that 
expert teachers could consider and potentially act on. 


A limitation of our approach was that, as discussed in Section 
4.2, our survey allowed experts to write open-ended miscon- 
ception diagnoses which resulted in low frequency of some 
words and thus a more challenging downstream prediction 
task. A future study could restrict the terms available for 
use in expert labels or have them simultaneously negotiate 
a shared taxonomy. Finally, the student response sequences 
used as input for the skip-gram models were partitioned by 
Khan Academy exercise due to us wanting to focus on a 
limited number of topic areas. This may have lead to missing 
misconception signatures that manifest or generalize across 
exercises. 


7. ACKNOWLEDGEMENTS 


We thank Khan Academy for sharing anonymized exercise 
data. This work was supported, in part, by a grant from the 
National Science Foundation (#1547055). 


8. REFERENCES 
1] R. Almond, I. Goldin, Y. Guo, and N. Wang. Vertical 
and stationary scales for progress maps. In EDM, 2014. 
2| J. R. Anderson. Rules of the mind. Psychology Press, 
2014. 
3] J. S. Brown and K. VanLehn. Repair theory: A 
generative theory of bugs in procedural skills. Cognitive 
science, 4(4):379-426, 1980. 
4) M. Eagle and T. Barnes. Exploring differences in 
problem solving with data-driven approach maps. In 
Educational Data Mining, 2014. 
5] M. Q. Feldman, J. Y. Cho, M. Ong, S. Gulwani, 
Z. Popovié, and E. Andersen. Automatic diagnosis of 
students’ misconceptions in k-8 mathematics. In CHI 
18, page 264. ACM, 2018. 
[6] R. Liu, R. Patel, and K. R. Koedinger. Modeling 
common misconceptions in learning process data. In 


[7] 


[10] 


11 


12 


13 


[14] 


[15] 


[16] 
[17] 
[18] 


[19] 


[20] 


21 


22 


23 


Proceedings of the Siath Intl. Conf. on Learning 
Analytics & Knowledge, pages 369-377. ACM, 2016. 
T. S. McTavish and J. A. Larusson. Labeling 
mathematical errors to reveal cognitive states. In 
European Conference on Technology Enhanced 
Learning, pages 446-451. Springer, 2014. 

J. J. Michalenko, A. S. Lan, and R. G. Baraniuk. 
Data-mining textual responses to uncover 
misconception patterns. In Proceedings of the 10th 
Conference on Educational Data Mining, pages 
208-2013, 2017. 

T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting 
similarities among languages for machine translation. 
arXiv preprint arXiv:1309.4168, 2013. 

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and 
J. Dean. Distributed representations of words and 
phrases and their compositionality. In Advances in 
neural information processing systems, pages 
3111-3119, 2013. 

A. Muehling. Concept landscapes-a new way of using 
concept maps. Journal of Educational Data Mining, 
9(2):1-30, 2017. 

Z. A. Pardos and A. Dadu. Imputing kcs with 
representations of problem content and context. In 
UMAP ’17, pages 148-155. ACM, 2017. 

Z. A. Pardos, Z. Fan, and W. Jiang. Connectionist 
recommendation in the wild: On the utility and 
scrutability of neural networks for personalized course 
guidance. User Modeling and User-Adapted Interaction, 
2019. 

Z. A. Pardos, S. Farrar, J. Kolb, G. X. Peh, and J. H. 
Lee. Distributed Representation of Misconceptions. In 
J. Kay and R. Luckin, editors, Proceedings of the 13th 
International Conference of the Learning Sciences 
(ICLS), pages 1791-1798, London, UK, 2018. 

R. Peldnek and J. Rihak. Properties and applications of 
wrong answers in online educational systems. In EDM, 
pages 466-471, 2016. 

J. Piaget. The child’s concept of number, 1952. 

A. Schoenfeld. Personal Communication. 

D. Selent and N. Heffernan. Reducing student hint use 
by creating buggy messages from machine learned 
incorrect processes. In Intl. Conf. on Intelligent 
Tutoring Systems, pages 674-675. Springer, 2014. 

J. P. Smith, A. A. DiSessa, and J. Roschelle. 
Misconceptions reconceived: a constructivist analysis of 
knowledge in transition. The journal of the learning 
sciences, 3(2):115-163, 1994. 

W. L. J.-E. Soloway. Intention-based diagnosis of 
programming errors. In Proceedings of the 5th National 
Conference on Artificial Intelligence, Austin, TX, pages 
162-168, 1984. 

M. Straatemeier et al. Math garden: A new educational 
and scientific instrument. Education, 57:1813-1824, 
2014. 

K. K. Tatsuoka. A probabilistic model for diagnosing 
misconceptions by the pattern classification approach. 
Journal of Educational Statistics, 10(1):55-73, 1985. 
L. S. Vygotsky. Mind in society: The development of 
higher psychological processes. Harvard university press, 


1980. Original Manuscripts ca. 1930-1934. 


347 Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 


